bregman divergence
Foundations of Top-k Decoding for Language Models
Top-kdecoding is a widely used method for sampling from LLMs: at each token, only the largest k next-token-probabilities are kept, and the next token is sampled after renormalizing them to sum to unity. Top-kand other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top-k decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top-k decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We introduce Bregman decoders obtained by minimizing a separable Bregman divergence (for both the primal and dual cases) with a sparsity-inducing โ0-regularization; in particular, these decoders are adaptive in the sense that the sparsity parameter k is chosen depending on the underlying token distribution. Despite the combinatorial nature of the sparse Bregman objective, we show how to optimize it efficiently for a large class of divergences. We prove that (i) the optimal decoding strategies are greedy, and further that (ii) the objective is discretely convex in k, such that the optimal k can be identified in logarithmic time. We note that standard top-k decoding arises as a special case for the KL divergence, and construct new decoding strategies with substantially different behaviors (e.g., non-linearly up-weighting larger probabilities after renormalization).
High-Dimensional Calibration from Swap Regret
We study the online calibration of multi-dimensional forecasts over an arbitrary convex set P Rd relative to an arbitrary norm k k. We connect this with the problem of external regret minimization for online linear optimization, showing that if it is possible to guarantee O( ฯT) worst-case regret after T rounds when actions are drawn from P and losses are drawn from the dual k k unit norm ball, then it is also possible to obtain -calibrated forecasts after T = exp(O(ฯ/2)) rounds.
Latent Process Generator Matching
Billera, Lukas, Nordlinder, Hedwig Nora, Murrell, Ben
A related situation arises when an auxiliary process is introduced to aid training but modelling its dynamics at generation time is unnecessary or difficult, as in Billera et al. [2025b] and Kim et al. [2025]. In each of these works, the projection result and its associated loss are derived on a case-by-case basis, and all theorems are restricted to marginalization over a discrete component of the extended state space. We introduce a general framework that removes these restrictions: given a time-inhomogeneous Feller process (Yt)0 t 1 on an arbitrary state space Y and a map ฮฆ: Y X, one may learn a linear parametrisation of the generator of a Feller process on X whose one-time marginals coincide with those of (ฮฆ(Yt))0 t 1. For Y = X Z and ฮฆthe projection onto the first coordinate, this subsumes these prior works as special cases, allowing for a general class of latent processes (Zt)0 t 1 in a nearly arbitrary state space Z, using the formalism of generator matching to allow for continuous, discrete, or manifold-valued processes. In particular, the learnt process at t = 1 samples from the distribution of ฮฆ(Y1), which is the desired data distribution. We give sufficient conditions for a loss function to be valid in this general setting, recovering the results of the works cited above as corollaries. This result has broad applicability, enabling the construction of a wide array of new flow matching schemes by allowing for a more general class of latent spaces. As a concrete new application, we outline a non-projection ฮฆ: Y X with manifold-valued latents for protein structure generation that separates chain-level rigid-body motion from internal flexibility ( 4), where the particular chain-level versus residue-level or internal state is latent, and the model only sees the world state, which we plan to implement in future work. 2 EARLIERWORK Several recent generative models train with the aid of a latent stochastic process that is marginalised out at generation time.
Calibeating for general proper losses: A Bregman divergence approach
Fichtl, Maximilian, Guzmรกn, Cristรณbal, Mehta, Nishant A.
This work introduces a general framework for calibeating based on regret minimization. As compared to Foster and Hart's seminal calibeating work which had specialized treatments of Brier score (squared loss) and log loss, we consider a large family of proper losses that includes $ฮฑ$-Tsallis losses (for $ฮฑ\in [1, 2]$) and Lipschitz losses. Our results for Tsallis losses also hold for an unscaled version of Tsallis loss that recovers log loss. Our analysis is oriented around the Bregman divergence view of a proper loss. Technically, our results for the family of Tsallis losses that we consider are U-calibration results, simultaneously obtaining logarithmic regret for all losses in this family while having a weaker dependence on the dimension compared to previous results. Of potential independent interest, we also show a new regret equality for the regret of Be The Regularized Leader. This regret equality holds for general proper losses and itself is based on two results related to online updating formulas for the generalized variance, the latter being a previously introduced generalization of variance based on Bregman divergences.
Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning
Mohri, Mehryar, Schneider, Jon, Zhong, Yutao
The Distributional Alignment Game framework provides a powerful variational perspective on Answer-Level Fine-Tuning (ALFT). However, standard algorithms for these games rely on estimating logarithmic rewards from small batches, introducing a systematic bias due to Jensen's inequality that can destabilize training. In this paper, we systematically resolve this structural estimation bias. First, we generalize the alignment game to arbitrary Bregman divergences, showing that for a family of geometries inducing polynomial rewards, we can construct provably exact and unbiased estimators using U-statistics. Second, for the canonical KL divergence game where an exact solution is impossible, we derive a globally robust minimax polynomial estimator that is provably optimal, achieving the fundamental statistical error limit of $ฮ(1/K^2)$, which we establish via the Ditzian-Totik theorem. Finally, we synthesize these two approaches to propose a novel Variance-Optimal Augmented Polynomial Optimization Program (AQP) Estimator, proving that by systematically reducing variance, our method achieves not only optimal bias but also provably accelerated game convergence, leading to more efficient and stable training with zero online computational overhead.
Clustering with Bregman Divergences: an Asymptotic Analysis
Clustering, in particular k-means clustering, is a central topic in data analysis. Clustering with Bregman divergences is a recently proposed generalization of k-means clustering which has already been widely used in applications. In this paper we analyze theoretical properties of Bregman clustering when the number of the clusters k is large. We establish quantization rates and describe the limiting distribution of the centers as k, extending well-known results for k-means clustering.
Learning Bregman Divergences with Application to Robustness
We propose a novel and general method to learn Bregman divergences from raw high-dimensional data that measure similarity between images in pixel space. As a prototypical application, we learn divergences that consider real-world corruptions of images (e.g., blur) as close to the original and noisy perturbations as far, even if in $L^p$-distance the opposite holds. We also show that the learned Bregman divergence excels on datasets of human perceptual similarity judgment, suggesting its utility in a range of applications. We then define adversarial attacks by replacing the projected gradient descent (PGD) with the mirror descent associated with the learned Bregman divergence, and use them to improve the state-of-the-art in robustness through adversarial training for common image corruptions. In particular, for the contrast corruption that was found problematic in prior work we achieve an accuracy that exceeds the $L^p$- and the LPIPS-based adversarially trained neural networks by a margin of 27.16\% on the CIFAR-10-C corruption data set.
A scaled Bregman theorem with applications
Bregman divergences play a central role in the design and analysis of a range of machine learning algorithms through a handful of popular theorems. We present a new theorem which shows that ``Bregman distortions'' (employing a potentially non-convex generator) may be exactly re-written as a scaled Bregman divergence computed over transformed data. This property can be viewed from the standpoints of geometry (a scaled isometry with adaptive metrics) or convex optimization (relating generalized perspective transforms). Admissible distortions include {geodesic distances} on curved manifolds and projections or gauge-normalisation. Our theorem allows one to leverage to the wealth and convenience of Bregman divergences when analysing algorithms relying on the aforementioned Bregman distortions. We illustrate this with three novel applications of our theorem: a reduction from multi-class density ratio to class-probability estimation, a new adaptive projection free yet norm-enforcing dual norm mirror descent algorithm, and a reduction from clustering on flat manifolds to clustering on curved manifolds. Experiments on each of these domains validate the analyses and suggest that the scaled Bregman theorem might be a worthy addition to the popular handful of Bregman divergence properties that have been pervasive in machine learning.
Representation Learning of Compositional Data
We consider the problem of learning a low dimensional representation for compositional data. Compositional data consists of a collection of nonnegative data that sum to a constant value. Since the parts of the collection are statistically dependent, many standard tools cannot be directly applied. Instead, compositional data must be first transformed before analysis. Focusing on principal component analysis (PCA), we propose an approach that allows low dimensional representation learning directly from the original data.